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This paper introduced an Intelligent Salat Monitoring and Training System 
based on machine vision and image processing. In Islam, prayer (i.e. salat) is 
the second pillar of Islam. It is the most important and fundamental 
worshipping activity that believers have to perform five times a day. From 
gestures’ perspective, there are predefined human postures that must be 


performed in a precise manner. There are lots of materials on the internet 
and social media for training and correction purposes. However, some 
Keywords: people do not perform these postures correctly due to being new to salat or 
even having learned prayers incorrectly. Furthermore, the time spent in each 
posture has to be balanced. To address these issues, we propose to develop 


Image processing 


Intelligent an assistive intelligence framework that guides worshippers to evaluate the 
Monitoring correctness of their prayer’s postures. Image comparison and _ pattern 
Salat (Islamic prayer) matching are used to study the system’s effectiveness by using several 
Salat posture combining algorithms, such as Euclidean distance, template matching and 


grey-level correlation, to compare the images of the user and the database. 
The experiments’ results, both correct and incorrect salat performances, are 
shown via pictures and graph for each of the postures of salat. 
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1. INTRODUCTION 

Salat (prayer) is one of the main pillars in Islam, which is considered one of the most important 
aspects of our faith. Our beloved Prophet Muhammad (peace be upon him) received the commandments of 
salat during Isra’ and Mi’raj (the night journey). Hence, giving hope for humanity once more as they have 
lost the light on how to worship the true and the only one God, The Almighty Allah (Glorious is He and He is 
Exalted). During that time, Muslims learn how to perform salat by following the orders and actions of 
Prophet Muhammad (peace be upon him). This is done by looking at the action and orders directly using the 
senses of sight and hearing. In other words, the technique used to learn salat at that time solely by using the 
human senses to detect the correct movements and words. Although Muslims during that time can only learn 
how to perform salat through the Prophet’s words and actions, the teaching and learning are highly effective 
because Muslims nowadays are still performing salat the same way as the Prophet. 

As the world is getting older, some Muslims tend to forget the proper way to perform salat as they 
are bound to the world. Today, technologies are improving at a very fast pace. Many kinds of research and 
development have been conducted to improve our lives. This raised a big responsibility for Muslim 
researchers to develop a technology that benefits this world and hereafter. With this goal, developing the 
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“Intelligent Salat Monitoring and Training System” as an educational tool could help many Muslims learn 
and recognize the proper way of performing salat. There are a few research have been done regarding the 
activity monitoring of salat. Alobaid and Rasheed [1], Al-Ghannam and Dossari [2] used smartphone 
technology to recognize Salat activities. Rabbi et.al. [3], Ibrahim and Ahmad [4] assessed the activities of 
salat using electromyographic (EMG) signals. Therefore, the most important technology that we need to learn 
and master to perform the salat inspection and training system for the Muslim community is machine vision 
and image processing. 

Computer vision is used to inspect and track human movement in various fields, such as sport, 
health care and even games. As Muslim, we usually notice and understand how to perform salat by following 
others. We follow the postures and movements of others in performing salat mainly by scanning using our 
eyes. Then we analyze and process the learning of salat, whether it is correct or wrong. However, by 
combining this technology with the religious aspect, we will gain many advantages. Using the system, we 
can learn the proper way of salat by looking at the correct posture of salat installed in the database system; 
thus, giving the proper feedback to the user. The system feedback can be in the form of words and numbers, 
indicating the percentage of error in the salat movement. 

Many researchers found algorithms to detect human parts, such as the face, hand, movements, and 
postures. Some of the algorithms can detect the posture of the human body [5]. Different algorithm leads to 
differences in need for the system. Therefore, some consideration must be made to have a fully functional 
system. Approaches and algorithms to perform the inspection and training system for salat must be chosen so 
that the image we need to measure and compare does not lack the information needed by the system. 
Elements like the angle of sight, size, color, and texture of the image need to be measured using multiple 
algorithms to get an accurate result so that the system does not give the wrong feedback to the user. In order 
to overcome the problems, the MATLAB program is used to implement and test the methods proposed. 

Muslim communities and others who convert to Islam across the world are in dire need of the basic 
knowledge of salat. By developing the salat inspection and training system, this technology can teach and 
share the knowledge with ease. The system resolves several problems as: i) help Muslim across the world in 
learning the correct ways of salat anytime and everywhere; ii) reduce time, cost, and does not use much space 
for learning salat; iii) avoid from being used by fake preachers in learning salat; iv) for Muslims who feel 
embarrassed to learn the salat from others; and v) help newly converted Muslims to learn salat with ease. 

Several movements in salat are considered important [6]. These movements and postures are needed 
to be in a correct manner so that Allah will accept our salat. Notably, salat has a few sequenced movements 
to have a complete cycle known as raka’ah. The sequences of one complete cycle are shown in Figure 1. 
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two sujud final tashahhud 


Figure 1. Sequences for one complete cycle in salat [6] 


2. RELATED WORKS 
2.1. Human body modelling in machine vision 

Human motion and pose recognition can be categorized into two types of models, which are model- 
based and appearance-based methods. Model-based object tracking algorithms are based on simple CAD 
(computer-aided design) wire models of objects, as shown in Figure 2. Using this kind of models, we can 
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draw the starting and endpoint of the lines correctly into the image plane and granting a real-time tracking of 
objects at the cost of a small computational effort. Appearance-based method use no priori knowledge on the 
data present in the image. Instead, it analyzes the data by using the statistic of the available dataset in the 
database to extract the modes. By doing so, it will group the data in the best possible condition. According to 
Azad et al., the appearance-based method uses various algorithms to illustrate the object [7]. In other words, 
appearance-based approaches are more reliable in many types of situations because they do not require a 
specific object to be a model. 


Figure 2. Illustration of the object using wire model [7] 


2.2. Representation of human figure 
2.2.1. Bounding box 

One of the simplest representations of the human body is the bounding box. Although the function 
of the bounding box is limited, the model is useful when the image of the human body in the picture is very 
small because it only used a few pixels. This will reduce the complexity in image processing but at the cost 
of accuracy. Figure 3 shows the bounding box as a human representation in human body modelling [8]. 


Figure 3. Bounding box representation of human body [8] 


2.2.2. Stick representation 

The stick or bone figure representation typically represents the human body in machine vision and 
image processing. The stick is acting as a bone and make the pose or movement of the human. Figure 4 
shows the stick representation of the human body. The disadvantages of this figure are some of the 
movements like sitting will be difficult to make because of occlusion [9]. 


2.2.3. Multi-dimensional representation 


Hand gesture, as one of the important ways for human to convey information and express intuitive 
intention, has the advantages of high degree of differentiation, strong flexibility and high efficiency of 
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information transmission, which makes hand gesture recognition (HGR) as one of the research hotspots in 
the field of human-machine interface (HMI) [10]. The two-dimensional (2D) contour representation used 
the human body and projected it from three-dimensional (3D) space onto the two-dimensional image plane 
[11]-[13]. It will approximate the human body by using deformable contours, ribbons, or cardboards [14]. 
Figure 5 shows 2D images of the hand. Three-dimensional (3D) representation describes the parts of the 
human body in 3D space using a combination of cylinders as shown in Figure 6 [15]. The 3D representation 
shape can also use other shapes such as a cone or sphere to represent the human body. 
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Figure 4. Stick figure representation of human body [9] 


Figure 5. Pictures of 2D hand [12] 


Figure 6. 3D human motion representation [15] 


Intelligent system for Islamic prayer (salat) posture monitoring (Md Mozasser Rahman) 


224 o ISSN: 2252-8938 


2.3. Algorithm for pattern and image matching 
2.3.1. Histogram of the oriented gradient 

A histogram of the oriented gradient (HOG) could be used for image processing matching purposes 
such as face recognition [16]-[18]. The theory behind HOG measurements is distributed within the region 
range in the image. It is very useful for matching and tracking textured objects, which have inorganic shapes. 
For application of computer vision on human hands indicate HOG as the better performer compared to other 
feature extraction models [19]. HOG was applied to the base feature images to generate feature descriptors 
[20]. 


2.3.2. Hidden markov model 

The hidden markov model (HMM) is very famous among speech recognition, and it is one kind of 
model that uses statistic to extract the features. According to Wang et al., HMM is more reliable in analyzing 
time-varying data with variations in space-time conditions [21]. In matching procedures, it will compute the 
probability of HMM to generate the test symbol and its sequences which corresponds to the features of the 
input image. HMM is considered one of the best algorithms in matching the human motion pattern because it 
can handle uncertainty or unknown in its stochastic framework [22]. However, there is a significant 
disadvantage of this method. The HMM is inefficient in handling three or more processes that are 
independent [23]. 


2.3.3. Euclidean distance 

Euclidean distance can define the metric of the image efficiently. It used the Euclidean metric to 
measure the distance between two connected points in a straight line in Euclidean space. According to Wang 
et al., this method consists of the summation of the pixel-wise intensity differences [24]. They stated that the 
traditional Euclidean distance might cause small deformation in using a large Euclidean distance. To solve 
the problem, they proposed a method that can solve any reasonable metric. The keys for their method are 
simplicity in computation, relative insensitivity to small deformation, and increased efficiency in embedding 
the system in most of the powerful image recognition. 


2.3.4. Temporal template 

Bobick and Davis used temporal templates to recognize human movements by constructing a vector 
image to match it against the image, whereby the movement is known and stored in the database [25]. Two 
types of features were used, namely motion-energy image and motion-history image. There are many 
advantages to using these methods. They could support direct recognition of the motion, instantly perform 
temporal segmentation, invariant to linear changes in speed, and be run in real-time on a standard platform. 
Some limitations were detected, such as it cannot handle incidental motion, and occlusion may sometimes 
happen at a certain point. 


3. METHOD 
3.1. Actual picture and mechanical design 

To design the salat inspection and training system, we used polyvinyl chloride (PVC) pipe as the 
base in the design. In this design, we prioritize portability first as it requires large spaces to place or store it. 
By using PVC pipe, we can assemble and disassemble it easily, which takes less than five minutes. In order 
to build the base of the system, a combination of plain ended pipe, equal elbow pipe, end cap pipe, and equal 
tee pipe are needed. Figure 7(a) shows the actual picture the system and the isometric view of the system is 
shown in Figure 7(b). In the actual picture, two black lines are drawn on the base carpet. The middle black 
line is for the initial position of at-tawarrok. The user will sit there until the system finished the inspection. 
The black line located near the back camera is for the initial position for takbiratul ihram, ruku’ and sujud. 
The user will perform all these postures of the salat at their respective initial position, marked with the black 
lines. Two cameras are installed in the system as shown in Figure 7(b). One camera is installed at the front to 
inspect the front part of the body, such as hands and head, which are placed higher than the second one. The 
second camera is installed at the back to inspect the back part of the body, such as the legs, placed lower than 
the first one. The front camera is used to inspect the postures of salat for takbiratul ihram, ruku’ and sujud, 
while the back camera is used to inspect the posture during at-tawarrok. 

Two servo motors are installed in the system located below the cameras. The function of these 
servomotors is to change the camera angle when taking the video of the user’s salat using the system. This 
system can be carried and implemented everywhere because of its unique features. In the base carpet, a force- 
sensing resistor is installed to inspect the user during the sujud. The force-sensing resistor is used during 
sujud to check whether or not the parts of the body, such as the forehead and nose, are touching the ground. 
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Front Camera 


(b) 


Figure 7. Salat monitoring and training system (a) actual picture and (b) isometric view 


3.2 Experimental Method 

In this study, we adopted a template matching approach considering its simplicity in real time 
application. To detect human body, the color space RGB (red, green, and blue) is chosen, then we de- 
correlate the luminance and chrominance. With a given RGB image, it is converted to grayscale image using 
the RGB-to-grayscale conversion equation. The mean values of the three components generate a gray scale 
image: 


Gray = (0.2126 x Red?” + 0.7152 x Green*” + 0.0722 x Blue??)+/22 (1) 


Furthermore, the input image converted to grayscale as we need to match with the database images 
as template to the input image. However, to process the matching we would choose an approach. The 
matching process moves the template image to all possible 35 positions in a larger source image and 
computes a numerical index that indicates how well the template matches the image in that position. One of 
the well know matching process is Euclidean distance, Let J be a gray level image and g be a gray-value 
template of size (nxm): 


d(l,g.r,c) = [en BUG +he+/)— 9D? (2) 


where (7, c) denotes the top left corner of template g. 
Second matching process which has the accuracy advantage and processing time over Euclidean distance is 
grey-level correlation: 
cor = EEG Ci- 2). (3) 
[exatei-2 DN Oi? 


Where, 

x is the template gray level image 

X is the average gray level in the template image 

y is the source image section 

y is the average gray level in the source image 

N is the number of pixels in the section image 

The value cor is between -1 t0+1, with larger values representing a strongrt relationship between the two 
images. 

As we now the correlation matching result never shows 100% matching as the images are different 
in small details. Therefore, we should apply a threshold for the correlation result, the threshold can be set the 
highest value of match accrued. Regarding the feature extraction of this system, we considered HOG 
descriptors as one descriptor as shown in Figure 8 to show how the system could perform. Figure 8(a) shows 
the equivalent histogram of an image and the obtained HOG feature of the image is shown in Figure 8(b). 
However, there are many descriptors can be used or combined to work together. Other descriptor apply the 
same operation of coveting the image pixels to vote to its colornumber as it described from (0-255), 0 for 
white ad 255 for black color. However, Divide the feature into log-polar bins instead of dividing the feature 
into square is the commune used approach. To identify the image to the computer we need to use descriptors, 
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as high as we train the system with image descriptors as the error of the system decrease. However, 
combining multi type of descriptors “such as scale-invariant feature transform (SIFT), gradient location 
orientation histogram (GLOH) and speeded up robust features (SURF)” will help to enhance the 


performance ofthe system. 
255 **# Hh 


L 


>a bk 


(b) 


Figure 8. Process of calculating HOG (a) creating histogram from image and (b) visualization of HOG features 
of the image 


A sample HOG representation of both correct and wrong position during takbeerat alehram shown 
in Figure 9. The correct position, the rising of hand above shoulder, with it descriptor is shown in Figure 9(a). 
Wherease the wrong position, hand is below shoulder is shown in Figure 9(b). The descriptor of the image on 
the right side showing intensity variation in the images. 


Figure 9. Comparison of right and wrong position of takbeerat alehram using HOG representation of 
the image (a) correct position and (b) wrong position 


Both HGR and HOG descriptors were used for matching the two positions as shown in Figure 10, 
the error of the result can be recognized by the different between two extreme points. The difference in 
strong corners between two overlay images will represent the amount of unmatched features or error 
between matching two images. However, as this difference increase as the salat position had performed by 
the prayer is wrong. Therefore, we need to increase the extracted feature by increasing the corners number, as 
well as, threshold the matching result so our system would trigger the position as wrong if the difference 
between two images exceeds 30%. 


Figure 10. The result of the matching process using HGR and HOG 
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4. RESULTS AND DISCUSSION 

In this part, the user will use the system’s graphical user interface installed on the laptop. Figure 11 
show the interface provided by the system. The user can preview the correct posture of salat by clicking the 
button preview for each posture. After that, they can click the blue button in the interface for the system to 
start the inspection. Upon clicking the start button in blue colour, the video of the person who prays using the 
salat inspection and training system was taken using the video camera attached to the system. 

Two video cameras are needed to inspect the user. One is located at the front side of the user to 
inspect the front parts of the body movement of the salat, such as hands during takbiratul ihram. Another 
camera is located at the back of the user to inspect the back parts of the body, such as the legs during the at- 
tawarrok. The video will be taken until a beep sound was heard, indicating the system has finished taking the 
user’s picture. 


| —— an 


Figure 11. Graphical user interface (GUI) of Intelligent Salat Monitoring System 


Once the system finishes taking the video, it will start filtering the region of interest and undergo 
color conversion from RGB to grayscale. Then the image will be matched in the database using the template 
matching and Euclidean distance as the medium. In order to improve its accuracy, grey-level correlation is 
used to increase the system’s performance. 

If the salat is done correctly within the programmed value of the threshold, shown in the graph, a 
message will pop up that says “GOOD PERFORMANCE?” in a green-colored text. Otherwise, it will suggest 
the correct postures that you should do in a red-colored text, indicating the performance of your salat is bad. 
This stage applied to takbiratul ihram, ruku’ and at-tawarrok only. For sujud, a special graph indicates how 
many force readings will be shown to the user with its performance. By doing this, it can train the user to 
learn the salat until the correct posture is performed. 

The front camera is used for taking the video during takbiratul ihram, ruku’ and sujud. The system 
gives feedback to the user by showing the pictures and graph. Figure 12 dipicts the good takbeerat alehram. 
The left side is the screen shoot of the recorded video, middle graph shows the matches percentage and the 
notification image is shown in the right side. Figure 13 dipicts the bad performance during takbeerat alehram. 
The left side is the screen shoot of the recorded video, middle graph shows the matches percentage and the 
notification image is shown in the right side. 

Performance of correct ruku’ is shown in Figure 14. The left side is the screen shoot of the recorded 
video, middle graph shows the matches percentage and the notification image is shown in the right side. 
Figure 15 describes the information regarding incorrect ruku’ performance. The left is the template for the 
incorrect postures stored in the database, bad performance of ruku’ and matching percentages, which do not 
reach the threshold. 

If the system finds a frame with 98% matches a green rectangular will appear on the region of the 
interest, if the matches last for more than three seconds the performance of the posture considered correct and 
anotification pop up. However, if the matches didn’t last more than three second, the system will notify the 
student via a pop image and comment on the image. This is simply the feedback mechanism of the salat 
inspection and training system. 
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Figure 12. Takbeerat alehram, from the left is the recorded video, matches percentage and the notification 
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Figure 13. Takbeerat alehram front camera, wrong posture performing 


MATLAB 20142 
Match Metric =o 
File Teole View Playback Malp frit View men Took Deuking Window Help 


0/4485 Om : TSC Le Pe eo Iry 
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Figure 15. Performance of wrong ruku’ 
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All the results applied the same feedback concept in the system except for sujud. In sujud, additional 
information was added. A graph shows the reading on force-sensing resistor indicates the user is performing 
the sujud. Whenever the user’s nose and forehead touch the sensor located at the base carpet, the force will 
trigger the sensor, thus giving the reading on the MATLAB. There are 12 forces reading during sujud, set as 
the threshold to indicate good performance of the sujud. Figure 16 show the force sensor reading to verify the 
performance of sujud. Figure 16(a) shows how feedback on good Sujud performance is shown to the user, 
while Figure 16(b) indicates bad sujud performance. 


|| Fie Ete View” en Tees Oesktep Window | Hp IBS) Biansecan: 2 
DSae/kANOSVA-\A\0bi sa I: wid |p Aecarce  Runast 
aa incomming Data from Extemal Device =O} —— >] 
~P 
cedegFereSemonm ox 
oa GeivedecSujedm | FoeceSemeem » | brdutewsrchen & | GelvidecTawsrckn | + | 
ne msc se if _ 
hol 
8 025 | 
3 & 
: | 
# 02 4 
S 
2 pprctberasi7 
= 
c 
< 
$0 from Esternel Devace*ir a 
Commard Weedow @ 
31 ° 
Forceszesding = 
22 
((Geea Sujod Performance) 2 
| Kiss & 
me l lal Got 


arent XE 


| Fite | Et Yom haert’ Teols Deskicp Window Hale bine ip 
Usads ki \RGSVL-\8 08/90 reed Ls Advency  Ponwe 
e003 > = 
@x 
0025 FoxeSemerm = | adulemarnies GetvalecTerenzim »| + 
beg” | 
7 
3 
S 
& 
5 
> 
= 
2 
E 
< 
Boot Ferforeasce)) ‘is \ = 
{ 
feorceheeding = 
: 
| {Ras fuses Fertcreacce)) a 
a . » : 
e [Feces —“*‘séisN CO 


(b) 


Figure 16. Force-sensing resistor reading during the sujud showing the performance of (a) good sujud and (b) 
bad sujud 
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5. CONCLUSION 

In conclusion, the first objective is to learn the correct postures in salat, whereby we analyze books 
of hadith and Muslim scholars in the literature review section. The second objective is to develop an image- 
processing algorithm using MATLAB, which is also achieved as the result show quite a good performance. 
The third objective of this study is to test the salat performance and provide feedback to the user. This is also 
achieved as the output result of the matching image pop up the message about the salat performance and train 
the user by giving the correct instructions regarding the current postures of the salat. The results are quite 
accurate, as the method proposed is able to identify and match the pattern to recognize up to 90% and inform 
the user about their salat performance. The reading in the graph is more accurate when the user performs 
salat using the system itself because the camera angle is fixed. Although the posture is correct, some results 
show errors when the lighting is bad. This is because the pattern matching in MATLAB is confused when the 
lighting is insufficient. It will affect the results of pattern matching, for example, the posture of salat is 
correct, but the system keeps on giving bad performance feedback to the user. This issue can be solved by 
using the system under sufficient light; hence, increasing the accuracy of the overall system. It is 
recommended to bright room to ensure clear images captured. The camera angle also needs to be fixed and 
constant between the database and the correct and wrong image for the system to detect the pattern without 
an error. 
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